This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.
● ScheduledDay tells us on what day the patient set up their appointment.
● Neighborhood indicates the location of the hospital.
● Scholarship indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.
● Be careful about the encoding of the last column: it says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.
A csv file contian the data we will analyze.
I will answer the following questions:
Q1.Does age effect attendence?
Q2.does age and chronic diseases affect the attendence?
Q3.Are SMS notifications associated with lowered incidences of No Shows?
Q4.does neighbourhood affect the attendance?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
plt.rcParams['font.size']=14
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()
#exploring the shape of the data
df.shape
data contian 110527 raw and 14 column
# columns data types
df.dtypes
# check for duplicated rows
df.duplicated().sum()
# check for missing values
df.isnull().sum()
There aren't any missing values or duplicated data in any column.
# check for unique values
df['PatientId'].nunique()
62,299 Patient ids so this means that some patients had more than one appointment
#check the number of duplicated ids.
df['PatientId'].duplicated().sum()
48228 dulicated ids.
#check duplicated status of showing or not of the same ids.
df.duplicated(['No-show','PatientId']).sum()
there are 38710 patient have the same status of show or no-show we will remove them in DataCleaning process
# check for missing values
df.isnull().sum()
no missing values.
df.describe()
mean age is: 37 , max age is: 115 , min age is:-1 (data entry error) , 50% 0f ages are between 18 and 55 years old.
most patient are not handicapped
#identifying the row index with -1 age value.
m=df.query('Age=="-1"')
m
so row number 99832 has a problem so we will remove it
# removing the -1 age row
df.drop(index=99832 , inplace=True)
correction of names
df.rename(columns={'No-show' : 'No_show'},inplace=True)
df.rename(columns={'Handcap' : 'Handicap'},inplace=True)
df.rename(columns={'Hipertension' : 'Hypertension'},inplace=True)
df.head()
Drop columns
df.drop(['PatientId', 'AppointmentID', 'AppointmentDay', 'ScheduledDay'], axis=1, inplace=True)
df.head()
Differentiating between people with Handicap and not
# if the value is greater than 1 change it to 1, otherwise keep it
df['Handicap'] = np.where(df['Handicap'] > 1, 1, df['Handicap'])
# confirm
df.Handicap.value_counts()
general look
df.hist(figsize=(20,8));
Deep Look on show status
df['BinNoShow'] = (df.No_show == "Yes").astype(int)
df.No_show.value_counts().plot.pie(figsize=(6,6), autopct='%.2f%%', explode=(0, .05))
plt.show()
def label(x,y,t,z):
plt.title(x)
plt.xticks(y)
plt.xlabel(t)
plt.ylabel(z)
plt.show()
Deep Look on Paitents Gender
sns.countplot(data = df, x = 'Gender', color = "red")
label('GenderPaitents',[0,1], '','Number of Patients')
Deep Look on Paitents Scholarship Status
sns.countplot(data = df, x ='Scholarship', color = "red")
label('Paitents Scholarship Status',[0,1], '','Number of Patients')
most of them have no scholarship
Deep Look on Paitents Alcoholism Status
sns.countplot(data = df, x ='Alcoholism', color = "red")
label('Paitents Alcoholism Status',[0,1], '','Number of Patients')
Most of them don't drink alcohol
Deep Look on Paitents Hypertension Status
sns.countplot(data = df, x ='Hypertension', color = "red")
label('Paitents Hypertension Status',[0,1], '','Number of Patients')
A quarter of them do not suffer from Hypertension
Deep Look on Paitents Diabetes Status
sns.countplot(data = df, x ='Diabetes', color = "red")
label('Paitents Diabetes Status',[0,1], '','Number of Patients')
Most of them don't have Diabetes
Deep Look on Paitents Handcap Status
sns.countplot(data = df, x ='Handicap', color = "red")
label('Paitents Handicap Status',[0,1], '','Number of Patients')
Most of them don't have Handicap
Deep Look on Paitents SMS_received Status
sns.countplot(data = df, x ='SMS_received', color = "red")
label('Paitents SMS_received Status',[0,1], '','Number of Patients')
Half did not receive the message
Age distrpution
bin_edges = np.arange(0, df['Age'].max()+5, 10)
plt.hist(data = df, x = 'Age', bins = 50)
plt.xlabel('Age')
plt.title('Distribution of Patients Age');
most of the patients are young and middle age.
#Patients are divided into two groups according to attendance
show=df.No_show=='No'
noshow=df.No_show=='Yes'
df[show].count(),df[noshow].count()
def attendance (df, col_name, attended, absent):
plt.figure(figsize=[15,7])
df[col_name][attended].plot(kind='hist',alpha=0.9,bins=15, color='green' , label='show')
df[col_name][absent].plot(kind='hist',alpha=0.9,bins=15, color='darkviolet' , label='noshow')
plt.legend();
plt.title('Comparison according to age')
plt.ylabel('Patients Number')
plt.xlabel('Age');
attendance (df, 'Age', show, noshow)
plt.figure(figsize=[18,5])
df[show].groupby(['Hypertension', 'Diabetes' ]).mean()['Age'].plot(kind='bar', color='green', label='show')
df[noshow].groupby(['Hypertension', 'Diabetes' ]).mean()['Age'].plot(kind='bar',color='darkviolet', label='noshaw')
plt.legend();
plt.title('Comparison of chronic disease and age')
plt.ylabel('MEAN AGE')
plt.xlabel('CHRONIC DISEASES');
yes_or_no = {1:'Yes', 0:'No'}
# bar plot the percentage of noshows for each sms condition
ax = sns.barplot(x=df.SMS_received.map(yes_or_no), y=df.BinNoShow)
ax.set_ylabel('% No Shows');
df.Neighbourhood[show].value_counts().plot(kind='pie', figsize=(100,50),autopct='%1.2f%%',label='show')
plt.ylabel('')
plt.legend(fontsize=9)
plt.title('showing neighbourhood',fontsize=50);
df.Neighbourhood[noshow].value_counts().plot(kind='pie', figsize=(100,50),autopct='%1.2f%%',label='noshow')
plt.ylabel('')
plt.legend(fontsize=9)
plt.title('noshowing neighbourhood',fontsize=50);
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])